Parrotpark: Why and How to self-host LLMs
Jonas Stettner | CorrelAid @ CDL
2025-05-07
Agenda
What is Parrotpark?
Why self-hosting?
How to self-host?
What to self-host?
Parrotpark Architecture
Evaluation
Demonstration
Discussion
What is Parrotpark? 🦜
Cooperation with D64 in the context of the project “Code of Conduct: Democratic AI”
Experimental infrastructure project for self-hosting LLMs and accompanying applications
Targeted at work in civil society organizations
Project is finished and ended with an evaluation
Why self-hosting?
Digital Sovereignty
Dependence, lack of transparency and little control:
Data processing (GDPR)
Resource consumption
Properties and training of the models
Model and tool usage/configuration; e.g. web search (🗲 GUI apps such as GPT Builder)
How to self-host: LLM hosting options
LLM inference:
Azure OpenAI on EU servers - ✅ GDPR
Open models
- ✅ More transparent model characteristics
API services hosted in the EU
Dedicated GPU server - ✅ Fully transparent resource consumption (only inference)
Dedicated vs API: Costs for EU Provider Scaleway
Claude 4 Opus on Open Router: $15/M input tokens; $75/M output tokens
Dedicated vs API: Costs for EU Provider Scaleway
How much VRAM can we afford?:
L4
with 24GB, limits model choices + context window size
Dedicated vs API: Example Pricing Calculation
Scaleways allows automated GPU instance creation (unlike Hetzner), so we deploy only during working hours
\(\text{Cost} = \text{€}0.75 \times (10\,\text{h} \times 5\,\text{days} \times 4\,\text{weeks}) = \text{€}150\)
Including tax (19%):
€178.50
Compared to Mistral Small 3.2 24B via OpenRouter (assuming 50/50 input/output split):
€178.50 = $210.27 (at €1 = $1.178)
\(\frac{\text{\$}105.14}{0.05} + \frac{\text{\$}105.14}{0.10}\)
=
3,154M tokens
Per working day:
\(\frac{3{,}154}{20} = 158\,\text{M tokens/day}\)
Dedicated vs API: Why Dedicated?
Maximum control and transparency
More predictable/fixed costs
One can fit other services on the GPU server
Exact metrics on hardware and inference server level
What to self host? - Which LLM?
Model size, context and concurrency:
Infrastructure choice limits model options
Pre-quantised models on Hugging Face
Which language and task is the model used for?
“When a measure becomes a target, it ceases to be a good measure.”
For automated tests: LLM as a judge
Finetuning vs. Prompt Engineering
What to self host? - Other required services
Inference Server: Ollama, vLLM, Hugging Face TGI etc.
LiteLLM: Control over the API
Chat Interface: LibreChat, Open WebUI
Databases (Vector, Application) and File Storage
Metrics
Auth
MCP/Stuff for Tools, e.g. web search
Parrotpark Architecture: High Level Overview
]
Parrotpark Architecture: Misc. Details
Implementation as Infrastructure as Code (Terraform and Ansible)
Nested (Entrance Server periodically runs IaC)
Metrics scraped with Telegraf and sent to external TimeScale Database, connected to Metabase
Evaluation
Time window: June 17th to June 27th (9 working days)
Scraped Metrics: http://mtbs.correlaid.org/public/dashboard/6032e4e9-e87a-49d7-bd67-f0d92552cc1c
User Survey
Evaluation: Tokens and Pricing
Total processed tokens:
329,503
input /
103,083
output
❌ API service for the same model would have cost waaaaay less:
$0.027
vs ~(€178.5/2)=€89
Demonstration